Add a 4-digit SWAR follow-up to loop_parse_if_eight_digits (clang) by fcostaoliveira · Pull Request #382 · fastfloat/fast_float

fcostaoliveira · 2026-06-01T11:20:56Z

After the 8-digit SWAR block loop in loop_parse_if_eight_digits, a remaining
run of 4–7 digits is currently finished one byte at a time by the caller. This adds
a single 4-digit SWAR step (reusing the existing read4_to_u32 /
is_made_of_four_digits_fast / parse_four_digits_unrolled helpers, already used
elsewhere in the file) to consume four of those digits at once. The parsed result is
identical — it is purely a faster way to consume the same digits.

Benchmark — m8g.metal-24xl (Graviton4 / Neoverse V2), -O3 -march=native,
simple_fastfloat_benchmark, from_chars→double, clang 18, base vs patch
measured back-to-back (reproduced across 3 runs):

dataset	clang 18
canada.txt	+11.7%
mesh.txt	+7.7%
random [0,1]	~0%

About the `#if defined(clang)` gate

I gated this to clang, and I want to be upfront about why rather than hide it. On
gcc the extra 4-digit check regresses inputs whose remainder is shorter than 4
digits (e.g. the 17-digit fraction of uniform [0,1]: gcc spends the check but never
takes it, ~−3% on random), while clang shows no such regression and a large gain on
the mid-length fractions in canada/mesh. Gating keeps the clang win with zero
regression on either compiler, but it does introduce a compiler-specific branch.

I'm happy to do whatever you prefer here: drop the gate (accepting the gcc random
regression), keep it gated, rework it into a form that helps both, or drop the PR if
you'd rather not carry a compiler-specific path. Numbers and reasoning are all here so
you can decide.

Correctness

FASTFLOAT_TEST (14/14) + supplemental corpus + the randomized random_string /
short_random_string tests pass under clang; an exhaustive float32 sweep
(FASTFLOAT_EXHAUSTIVE) is running. Clean C++11/C++20 under
-Werror -Wall -Wextra -Weffc++ -Wconversion -Wsign-conversion; clang-format clean.
No multi-byte reads beyond the existing endian-safe read4_to_u32, so big-endian is
unaffected.

After the 8-digit SWAR block loop, consume a remaining 4-7 digit run in one read4_to_u32 + parse_four_digits_unrolled step instead of byte-by-byte (reusing the existing 4-digit helpers). The parsed result is identical; this is purely a faster way to consume the same digits. Gated to clang: on gcc the extra 4-digit check regresses inputs whose remainder is < 4 digits (e.g. the 17-digit fraction of uniform [0,1] -> -3% on 'random'), because the check becomes pure overhead there; clang does not show that. m8g.metal-24xl (Graviton4), -O3 -march=native, simple_fastfloat_benchmark, from_chars->double, clang 18, base vs patch back-to-back (2 samples): canada.txt +11.7%, mesh.txt +7.4%, random ~flat. No regression.

fcostaoliveira · 2026-06-01T12:09:36Z

Exhaustive validation finished: all five float32 sweeps pass under clang against this branch — exhaustive32, exhaustive32_64, exhaustive32_midpoint, long_exhaustive32, long_exhaustive32_64 (5/5, 0 failures).

lemire

Let us merge once the tests are green.

lemire approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a 4-digit SWAR follow-up to loop_parse_if_eight_digits (clang)#382

Add a 4-digit SWAR follow-up to loop_parse_if_eight_digits (clang)#382
fcostaoliveira wants to merge 1 commit into
fastfloat:mainfrom
redis-performance:pr/four-digit-followup

fcostaoliveira commented Jun 1, 2026

Uh oh!

fcostaoliveira commented Jun 1, 2026

Uh oh!

lemire left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fcostaoliveira commented Jun 1, 2026

About the #if defined(__clang__) gate

Correctness

Uh oh!

fcostaoliveira commented Jun 1, 2026

Uh oh!

lemire left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

About the `#if defined(clang)` gate